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Abstract 

RNA-binding proteins (RBPs) bind to tlieir target RNA molecules by recognizing specific RNA sequences and structural 
contexts. The development of CLIP-seq and related protocols has made it possible to exhaustively identify RNA 
fragments that bind to RBPs. However, no efficient bioinformatics method exists to reveal the structural specificities of 
RBP-RNA interactions using these data. We present CapR, an efficient algorithm that calculates the probability that 
each RNA base position is located within each secondary structural context. Using CapR, we demonstrate that several 
RBPs bind to their target RNA molecules under specific structural contexts. CapR is available at https://sites.google. 
com/site/fukunagatsu/software/capr. 



Background 

RNA-binding proteins (RBPs) play integral roles in var- 
ious post-transcriptional regulatory processes, including 
the splicing, processing, localization, degradation and 
translation of RNA molecules [1]. RBPs typically con- 
tain a limited set of RNA-binding domains, such as the 
RNA recognition motif and K homology domain, and 
they must bind to specific RNA molecules to function. 
The human genome contains more than 400 annotated 
RBPs [2]. Although most of these RBPs are still poorly 
characterized, it is known that the dysfunction of certain 
RBPs causes severe diseases, such as neurodegenerative 
disorders, heart failure and cancers [3,4]. RBP-RNA inter- 
actions and their specificities are important for under- 
standing the complex gene regulatory networks and the 
mechanisms of human diseases. 

Recent advances in 'ribonomic' technologies, such 
as cross-linking immunoprecipitation high-throughput 
sequencing (CLIP-seq, also referred to as HITS -CLIP) 
[5], individual-nucleotide resolution CLIP (iCLIP) [6], 
and photoactivatable-ribonucleoside-enhanced CLIP 
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(PAR-CLIP) [7], have enabled the study of RBP-RNA 
interactions, both on a genomic scale and at high 
resolution. The use of microarrays in the classical 
RNA-binding protein immunoprecipitation microarray 
(RIP-Chip) method [8] prevented the precise identifi- 
cation of binding sites. In contrast, CLIP-seq methods 
bond an RBP and RNAs covalently by ultraviolet cross- 
linking, collect them by immunoprecipitation and directly 
sequence the RBP-bound sites of the RNAs. Using these 
technologies, researchers can identify sequential RNA 
motifs that are over-represented around the binding 
sites of each RBP using bioinformatics methods sim- 
ilar to those used for analyzing transcription-factor 
binding DNA motifs [9]. Such sequential motifs are 
often very short (up to ten bases), and there are many 
unbound sites that have the same motif. Thus, sequential 
motifs alone cannot explain the specificity of RBP-RNA 
interactions. 

RBPs bind to their target RNA molecules by recogniz- 
ing specific RNA sequences and their structures. Sev- 
eral studies have addressed this issue by calculating the 
accessibility of RNA regions around the RBP-binding 
sites [10]. Here, the accessibility of an RNA region is 
defined by the probability that the region exhibits a single- 
stranded conformation. Theoretically, the accessibility 
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can be efficiently and exactly calculated using an energy 
model of RNA secondary structures [11,12]. Double- 
helical RNAs usually form the A-form helical structure, 
whose major grooves are too narrow to be accessed by 
RBPs [13], and Li etal. showed that the accessibilities tend 
to be high around the RBP-bound motif sites by analyzing 
RIP-Chip data [10]. However, it is not sufficient to con- 
sider accessibility alone in analyzing the structure-specific 
target recognition by RBPs. For example, Vtslp, which is 
a yeast RBP regulating mRNA stability, binds to its target 
CNGG sequential motif when it is located within hair- 
pin loops but not when it is located in single-stranded 
regions or other structures [14,15]. The human FET family 
of proteins, whose mutations are associated with amy- 
otrophic lateral sclerosis, bind to its target sequential 
UAN^^Y motif within hairpin loops [16]. Computational 
methods for calculating the secondary structural contexts 
of RNA molecules, such as bulge loops, hairpin loops and 
stems, are required to uncover the characteristics of the 
RNA structures that are recognized by the RBPs in vivo. 

In the present study, we developed an efficient algorithm 
that calculates the probabilities that each RNA base posi- 
tion is located within each secondary structural context. 
Six contexts of RNA secondary structures were taken into 
account, according to the well-established Turner energy 
model of RNAs [17]. These structures included stems (S), 
hairpin loops (H), bulge loops (B), internal loops (I), multi- 
branch loops (M) and exterior loops (E) (see Figure 1). 
We defined a structural profile of an RNA base as a set of 
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Figure 1 Visual representation of tlie six structural contexts. The 

six structural contexts are represented by six colors: stems (red), 
exterior loops (light green), hairpin loops (purple), bulge loops (pink), 
internal loops (blue) and multibranch loops (green). The unstructured 
context is the union of the exterior and multibranch loops. These 
colors are used throughout the paper. 



six probabilities that the base belongs to each context. At 
present, Sfold [18] is the only software that can calculate 
a structural profile. Sfold cannot be readily applied to 
tens of thousands RNA fragments because it uses a sta- 
tistical sampling method that requires huge sample sizes 
and computational costs, particularly when analyzing long 
RNAs or mRNAs. We implemented our efficient algo- 
rithm as software named 'CapR', which can compute the 
structural profiles for tens of thousands of long RNAs 
within a reasonable time by enumerating all the possible 
secondary structures of the RNAs. 

Results 

Methods overview 

We have developed a new algorithm that calculates the 
structural profiles of any RNA sequence based on the 
Turner energy model with time complexity 0{NW^) [17]. 
Here, N is the input sequence length and W is the 
maximal span, which is a given parameter of the maxi- 
mal length between the bases that form base pairs. The 
parameter W was introduced because considering very 
long interactions does not improve the accuracy of the 
secondary structure predictions but does increase the 
computational costs [19]. 

Let X be an RNA sequence of length N and a be a 
possible secondary structure on x without pseudoknots. 
We refer to a base in x as stem if it forms a base pair 
with another base, and represent it using the character 
S. Single-stranded bases are categorized into five struc- 
tural contexts, namely, bulge loop (represented by B), 
exterior loop (E), hairpin loop (H), internal loop (I) and 
multibranch loop (M), which are defined as follows. In a 
secondary structure representation, RNA bases are ver- 
tices of polygons whose edges are the RNA backbone 
or hydrogen bonds, which are shown as solid or dotted 
lines, respectively, in Figure 1. The exterior loop context 
is given to single-stranded bases if they do not form poly- 
gons. The hairpin loop context is given to single-stranded 
bases if they form a polygon that has a single hydrogen 
bond. The bulge and internal loop contexts are given to 
single-stranded bases if they form a polygon that has two 
hydrogen bonds, which are connected by a single back- 
bone edge for bulge loops and which are not connected 
by a single backbone edge for internal loops. Finally, the 
multibranch loop context is given to single-stranded bases 
if they form a polygon that has more than two hydro- 
gen bonds. Note that for a given secondary structure a, 
any base of x is unambiguously classified as one of the six 
structural contexts. Additionally, we define unstructured 
(U) to represent collectively the exterior and multibranch 
loop contexts. 

We assume that the probability distribution of the sec- 
ondary structures follows the Boltzmann distribution with 
respect to the Turner energy model [17]. The probability 
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p(i, 8) that a base at position / has the structural context 
8 e {B, £, //, 7, M, S} is given by 

p(ij)=^ J2 expi-AG(a.x)/RT) 
^^^^ aeQii,8) 

Z(x) = J2 exp(-AG(or,;c)/7?r) 
aeQo 

where AG(a,x) is the difference of the Gibbs energies of 
the given structure a and the structure ctq that contains 
no base pairs, R is the gas constant and T is the temper- 
ature (we used T = 310.15 K in this study). Qq is the set 
of all the possible secondary structures of x, and Q (/, 8) is 
the set of all the possible secondary structures in which 
the base at position / is in the structural context 8. Then, 
the structural profile of / is defined as the probabilities 
of the structural contexts {p(iy8)\8 e {B,E,HJ,M,S}}. 
Note that the structural profile satisfies the probability 
condition ^^p{iy 8) = 1. 

Our algorithm efficiently calculates structural profiles 
by referring to the Rfold model, which is a variant of the 
stochastic context-free grammar (SCFG) that calculates 
all the RNA secondary structures without redundancy 
[20]. In formal language theory, the RNA secondary struc- 
tures without pseudoknots are modeled by SCFG [21]. 
While the state transition rules of the Rfold model contain 
seven non-terminal symbols, our algorithm associated 
them with the six structural contexts. The details of the 
algorithm, which is a variant of the inside-outside algo- 
rithm of SCFG, are given in the Materials and methods 
section. 

Influence of the maximal span and the GC content on the 
structural profile calculations 

Before we investigated the structure-specific target recog- 
nition by RBPs, we evaluated the performance of CapR. 
Because we introduced the maximal span W, we needed 
to investigate an appropriate range for this parameter. 
Because GC content is known to affect the RNA sec- 
ondary structures, its effect was also analyzed. 

To investigate the dependence on the maximal span 
W, we applied CapR to 1,000 random RNA sequences of 
2,000 nucleotides with a fixed GC content (GC = 0.5). 
Figure 2A shows how the proportions of the calculated 
structural profiles depend on W, As expected, if W is 
small, the predictions are dominated by exterior loops 
because few bases form base pairs under this condition. 
Whereas the probabilities for bulge loops, hairpin loops, 
internal loops and stems are relatively stable for W > 100, 
the exterior loop probabilities monotonically decrease and 
the multibranch loop probabilities monotonically increase 
with increasing W, This is because at large W new base 
pairs form in exterior loops and exterior loops turn into 
multibranch loops. On the other hand, the probabilities 



of the unstructured context, which collectively represents 
the exterior and multibranch loop contexts, are insensitive 
to W (Additional file 1: Figure SI). Therefore, the unstruc- 
tured context can be adopted instead of the exterior and 
multibranch loop contexts to avoid the influence of the 
parameter Wy if a discrimination of the two contexts is not 
critical. 

Although Kiryu et al revealed the dependence of the 
accessibilities on the GC content [12], the dependence 
of structural profiles on the GC content has not been 
investigated. We investigated the dependence on the 
GC content by applying CapR to 1,000 random RNA 
sequences of 2,000 nucleotides with a fixed maximal span 
[W - 100). Figure 2B shows how the proportions of the 
computed structural profiles depend on the GC content. 
The stem probability is high and the unstructured prob- 
ability is low with a high GC content, probably because 
the energy of the G-C pairs is larger than that of the A-U 
pairs and palindromic sequences are more likely to occur 
in the high-GC background. This result suggests that 
users should carefully interpret the results when analyzing 
RNAs with biased GC content. 

Performance of CapR 

We evaluated the speed of CapR by comparing its compu- 
tational run-time with that of Sfold. The input sequences 
were generated randomly with equal probabilities of A, 
C, G and U. For Sfold, the number of sampled structures 
was set to its default value (1,000). The computation was 
performed on an AMD Opteron 6276 2.3 GHz with 1 GB 
memory. Figure 3A shows the computational run-times, 
which depended on the maximal span W and sequence 
lengths. In all cases, CapR was much faster than Sfold. 
Sfold could not run for N > 4, 000 while CapR did for 
N = 10, 000. These results show that CapR can compute 
structural profiles for long RNAs within a reasonable time. 

Next, we evaluated the accuracy of the structural pro- 
files computed by CapR using 8,775 RNA genes that have 
experimentally validated secondary structure annotations 
in the Rfam database [22]. We set W = 800 to allow for 
stem-forming of the base pairs with the longest distance 
observed in the Rfam dataset. To estimate the accuracy 
of the structural profiles, we calculated the area under 
the receiver operating characteristic curve (AUROC) for 
each structural context. Briefly, the AUROC is high if the 
probability /?(/, 8) for the structural context 8 annotated in 
Rfam is high. 

Table 1 and Figure 3B show the AUROC values and the 
receiver operating characteristic curves, respectively. The 
AUROC value for each structural context was larger than 
0.75, indicating that the computed structural profiles were 
very consistent with the Rfam annotation. For example, 
the structural profile of transfer RNAs (tRNAs), whose 
secondary structures are well characterized, is shown in 
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Figure 2 Dependence of the structural profiles on the maximal span W and GC content. (A) Dependence of the structural profiles on the 
maximal span l/l/.Thex-axis represents the maximal span l/l/.They-axis represents the averaged p(i,8) overall the nucleotides. (B) Dependence of 
the structural profiles on the GC content. The x-axis represents the GC content. The y-axis represents the averaged ps (/) over all the nucleotides. The 
unstructured context is represented by light blue. B, bulge loop; E, exterior loop; H, hairpin loop; I, internal loop; M, multibranch loop; S, stem; U, 
unstructured. 



Figure 3C. Each line represents averaged probabilities that 
each base belongs to each structural context across all 
tRNA genes in the Rfam dataset. Probabilities of the stem, 
hairpin loop, multibranch loop and exterior loop contexts 
were high at the corresponding parts of the tRNA clover- 
leaf structure (Figure 3D). Calculated structural profiles 
are interpreted by considering that stem probabilities 
tend to be overestimated by the Turner energy model. 
In the tRNA example, the calculated stem probabilities 
were slightly higher than the multibranch loop probabil- 
ities at positions 25, 43 and 44, which are annotated as 
multibranch loops in Rfam. 

Finally, the same analysis was conducted using Sfold, 
and the accuracies of the structural profiles predicted by 
CapR and Sfold were compared. The accuracies of CapR 
were comparable to those of Sfold (Table 1). 

Datasets and methods used in the CLIP-seq data analysis 

Because it was shown that CapR is accurate in calcu- 
lating structural profiles of RNA molecules, we applied 
it to several CLIP-seq datasets to reveal the structural 
specificities of RBP-RNA interactions. For the sub- 
sequent analyses, we downloaded CLIP-seq data of 
RBP-bound RNAs from the doRina database [23], and 
selected ten RBPs: GLD-1 (nematode), QKI (human), 
Pum2 (human), SRSFl (human). Nova (mouse), Lin28A 
(mouse), FXRl (human), FXR2 (human), FMR1_7 
(human) and FMR1_1 (human) [7,24-28] (refer to 
Materials and methods for the criteria for the data 
selection). FMR1_7 and FMR1_1 are two splicing iso- 
forms of FMRl. RBPs with two known sequential motifs 
(FXRl, FXR2, FMR1_7 and FMR1_1) were analyzed sep- 
arately for each of the motifs. Hereafter, these cases are 



represented by the protein names with their sequential 
motifs: FXRl(ACUK), FXRl(WGGA), FXR2(ACUK), 
FXR2(WGGA), FMR1_7(ACUK), FMR1_7(WGGA), 
FMR1_1(ACUK) and FMR1_1(WGGA). 

We created one positive dataset and two negative 
datasets for each of these 14 cases. The positive dataset 
was a collection of transcribed sequences of ±2,000 
nucleotides around each RBP-bound site. The RBP-bound 
sites were defined as sites of sequential motifs within 
the CLIP-seq peak regions. The two negative datasets 
are referred to as the unbound and shuffled datasets. 
The unbound dataset was a collection of transcribed 
sequences of ±2,000 nucleotides around a sequential 
motif site that was in the same transcriptional unit and 
within ±1,000 nucleotides of any RBP-bound site, but 
was not an RBP-bound site. In short, this dataset repre- 
sents the sequential motif sites that are transcribed but 
unbound by the RBP. The shuffled dataset was generated 
by randomly shuffling each of the upstream and down- 
stream sequences of each RBP-bound site by preserving 
nucleotide di-nucleotide frequencies for every sequence 
in the positive dataset. Thus it represents the sequential 
motif sites flanked by sequences with preserved sequence 
compositions. The details of the datasets are described in 
the Materials and methods section. 

We calculated the structural profiles of the positive, 
unbound and shuffled datasets for each of the RBPs 
{W = 200). Then, to evaluate the structural contexts that 
are significant in the positive dataset statistically, we 
defined a P score as follows. First, we calculated a P 
value using the one-sided Wilcoxon-Mann- Whitney test 
for each side for each position. Second, we selected the 
smaller P value of the two hypotheses and transformed it 
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Figure 3 Performance of CapR. (A) Computational run-times for different values of maximal span 1/1/ and sequence length A/.Thex-axis represents 
the sequence length H. The y-axis represents the computational run-time. (B) The receiver operating characteristic curve for each loop context. The 
X-axis represents 1 -specificity and the y-axis represents the sensitivity. The specificity and sensitivity are defined as true positive/(true positive + false 
negative) and true negative/(true negative -h false positive), respectively. (C) The structural profiles of tRNAs.Thex-axis represents the nucleotide 
positions from 5' to 3'. The y-axis represents averaged probabilities that each base belongs to each structural context across all tRNA genes in the 
Rfam dataset [22]. The black boxes represent the nucleotides annotated as stem in Rfam. (D) tRNA cloverleaf structure annotated in Rfam. B, bulge 
loop; E, exterior loop; H, hairpin loop; I, internal loop; M, multibranch loop; S, stem. 



into — log^o ^> which we designated the P score. Third, if a 
P score was calculated under the hypothesis that each con- 
text probability of the positive dataset was smaller than 
that of the negative dataset, we changed the sign of the P 
score. For example, a large positive P score indicates that 
the probability of that structural context is significantly 
larger in the positive dataset. Finally, the two P scores cal- 
culated for the two negative datasets were compared for 
each position, and the smaller P score was taken (if one 
P score was positive and the other was negative, we used 
0 instead of the two P scores). Note that the Bonferroni 



Table 1 AUC score of each structural context 



Software 


Bulge 


Exterior 


Hairpin 


Internal 


Multibranch 


Stem 


CapR 


0.847 


0.866 


0.890 


0.765 


0.852 


0.805 


Sfold 


0.842 


0.817 


0.890 


0.769 


0.853 


0.804 



correction was used for multiple testing. To avoid the 
effects of the artificial value selection for the parameter W, 
we used the unstructured context instead of the exterior 
and multibranch loop contexts in the following analysis. 
We confirmed that the choice of W actually did not affect 
the results (Additional file 1: Figure S2). 

Specific RNA structural contexts recognized by 
RNA-binding proteins 

We investigated the preferred RNA structural contexts 
for each RBP and revealed that most RBPs prefer a spe- 
cific structural context (Figure 4 and Additional file 1: 
Figure S3). Our method was robust regarding the selec- 
tion of the negative datasets, because selecting the larger 
P scores did not affect the results overall (Additional file 1: 
Figures S4 and S5). Among the 14 cases analyzed, six cases 
showed a preference for the unstructured context (GLD-1, 
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Figure 4 The distribution of tiie P scores for eachi RNA-binding protein. The x-axis represents the nucleotide positions and the y-axis represents 
the P score of ±20 bases around the sequential motif site. The position 0 denotes the start position of the sequential motif. Positive P scores for each 
structural context indicate that the positions tend to prefer the structural context. The black box represents the sequential motif site. The dotted 
lines show the corrected significance levels of the Bonferroni correction {a = 0.05). The panels represent the distribution of P scores for (A) QKI, (B) 
Pum2, (C) Lin28A, (D) FXR2(WGGA), (E) FMR1_7(ACUK), (F) FXR2(ACUK), (G) Nova and (H) SRSFl . B, bulge loop; H, hairpin loop; I, intemal loop; S, 
stem; U, unstructured. 



Fukunaga etal. Genome Biology 20^ 4, 15:R16 
http://genomebiology.com/201 4/1 5/1 /R1 6 



Page 7 of 1 5 



QKI, SRSFl, Nova, FXRl(ACUK) and FXR2(ACUK)). 
Except for Nova, the RBP-bound sites tended to form the 
unstructured context, but did not show preferences for 
the bulge, internal or hairpin loop contexts (Figure 4A and 
Additional file 1: Figure S3). It should be noted that these 
results could not be obtained by analyzing the accessi- 
bility alone, which does not discriminate between these 
non-stem contexts. 

Pum2 showed a preference for the hairpin loop context 
(Figure 4B). To our knowledge, this is the first report of 
the structural preference for the hairpin loop context by 
Pum2, which is known to be involved in germ cell devel- 
opment [29]. Lin28A showed preferences for the hairpin 
and internal loop contexts (Figure 4C). Lin28A is known 
to inhibit the maturation of let-7 miRNAs and the trans- 
lation of mRNAs that are destined for the endoplasmic 
reticulum [27]. The specificity of Lin28A to the hairpin 
loop context is consistent with the previous study [27]. 
In addition, our result is the first to suggest that Lin28A 
prefers the internal loop context in mRNA binding, and 
Lin28A has been reported to bind to the internal loop of 
let-7 miRNAs [27]. 

FXRl(WGGA), FXR2(WGGA) and FMR1_7(WGGA) 
showed preferences for the stem context (Figure 4D and 
Additional file 1: Figure S3), although RBPs were con- 
sidered to be unlikely to be bound to the stem regions 
of RNAs as already mentioned. These three RBPs (and 
FMR1_1) are members of the FMRP family and are known 
to be responsible for the fragile X syndrome. Darnell 
et al showed that FMRP-bound WGGA sites tend to 
form a G-quadruplex, which is composed of guanine-rich 
sequences forming a four-stranded RNA structure [30]. 
We suppose that the preference for the stem contexts 
could reflect the tendency that these family members rec- 
ognize the G-quadruplex; however, this should be investi- 
gated further as currently our energy model and grammar 
cannot deal with G-quadruplexes. 

FMR1_7(ACUK) showed preferences for the internal 
and bulge loop contexts (Figure 4E). To our knowl- 
edge, this is the first report of the structural specificities 
of FMRl. In contrast, FXR2(ACUK), where FXR2 is a 
homolog of FMRl, preferred neither the internal nor 
bulge loop context (Figure 4F). FMR1_7 has an exon inser- 
tion in its K homology domain that recognizes the ACUK 
sequential motifs [28]. This insertion appears to under- 
lie the differences in the structural specificity between 
FMR1_7(ACUK) and FXR2(ACUK). 

Positional preferences in the RNA structure recognition by 
RNA-binding proteins 

The present understanding of the structural specifici- 
ties of RBP-RNA interactions overlooks structures of 
the flanking sequences of RBP-bound sites. Therefore, 
we investigated the secondary structures not only of the 



RBP-bound sites but also of their flanking sequences. 
In fact, the positions with the highest P scores were 
not within the RBP-bound sites in some RBPs. QKI 
(Figure 4A), Nova (Figure 4G) and SRSFl (Figure 4H) 
preferred the unstructured context. High P scores were 
observed within the RBP-bound sites for SF2ASF, whereas 
they were observed in the flanking and upstream 
sequences for QKI and Nova, respectively. These results 
suggest that RBPs also recognize specific structures exist- 
ing outside of sequential motif sites, and CapR can 
uncover these positional preferences from ribonomic 
datasets. 

Figure 5A,B shows the nucleotide compositions around 
the RBP-bound sites of QKI and Nova. The flanking 
sequences of QKI-bound sites were guanine poor, whereas 
those of Nova-bound sites were uracil rich. Because 
sequences with low GC content tend to form an unstruc- 
tured context, the aforementioned positional preferences 
could be generated by the biased nucleotide composi- 
tions. To address this possibility, we investigated the rela- 
tions between the nucleotide compositions and structural 
specificities in the flanking sequences. We generated par- 
tially shuffled datasets by randomly shuffling sequences 
outside of the ±5 or 10 nucleotides of the RBP-bound 
sites with preserving di-nucleotide frequencies, and com- 
pared their structural profiles with those of the posi- 
tive datasets using the Wilcoxon-Mann- Whitney test. 
Then, the P scores for the shuffled and partially shuf- 
fled datasets were compared (Figure 6A,B). For QKI, 
whereas the shuffled dataset had positional preferences 
in the flanking sequences, the partially shuffled datasets 
had no significant preferences. This means that the struc- 
tural specificities of QKI could be generated by the biased 
nucleotide compositions in the flanking sequences. For 
Nova, the partially shuffled datasets still had significant 
P scores upstream of the RBP-bound sites. Therefore, 
the nucleotide compositions in the flanking sequences 
alone cannot generate the positional specificities of Nova, 
that is, sequences in distant regions could also con- 
tribute to the position-specific RNA binding of Nova. The 
nucleotide compositions around the RBP-bound sites and 
the analyses of the partially shuffled datasets of other 
RBPs are described in Additional file 1: Figures S6 and S7, 
respectively. 

Discussion 

In this study, we developed an efficient algorithm that cal- 
culates the structural profiles of RNAs, and implemented 
it as CapR. It is the fastest software that can be applied to 
tens of thousands of long RNAs. 

Using CapR, we investigated structural specificities of 
RBP target recognition using several CLIP-seq datasets. 
Our analysis revealed that most RBPs prefer specific 
structural contexts and some RBPs show positional 
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Figure 5 The nucleotide compositions around the RBP-bound sites. The nucleotide compositions of ±20 bases around the RBP-bound sites for 
(A) QKI and (B) Nova. The x-axis represents the nucleotide position and they-axis is the probability of each nucleotide. The black box represents the 
sequential motif site. 



preferences in their structural recognition. These findings 
could provide insights into the mechanisms of diseases 
involving RBPs. FMR1_7, where FMRl is a causative gene 
of the fragile X syndrome, was revealed to bind specifically 
to internal and bulge loops. The observed structural speci- 
ficity raises the possibility that disruption of the internal 
or bulge loop structures within the target sites of FMR1_7 
may cause this disease. On the other hand, the structural 
specificities of Nova were revealed to be affected by the 
sequences of distant regions. This means that a muta- 
tion of a nucleotide distant from the RBP-bound sites 
can cause changes to the secondary structures around 
the RBP-bound sites. Because some disease-associated 
single nucleotide polymorphisms in non-coding regions 



are reported to affect RNA secondary structures [31,32], 
CapR could also contribute to exploring disease mecha- 
nisms behind such polymorphisms. 

It has been shown that the secondary structures around 
the target sites of small interfering RNAs (siRNAs) and 
miRNAs influence their activities [33,34]. Kiryu et al, 
showed that the activity of an siRNA depends on the 
accessibility of the 3^ end of the siRNA target site, and 
Marin et al. showed that the 3^ end of an miRNA target site 
is more accessible than the other positions [12,35]. As sup- 
ported by the X-ray crystal structure of the guide-strand- 
containing Argonaute [36], these positional tendencies 
in the accessibility can reflect the kinetic aspects of the 
siRNA and miRNA binding mechanisms. We hypothesize 



A QKI B Nova 




position position 

Figure 6 Comparison of P scores of the positive datasets with P scores of the shuffled and partially shuffled datasets. In the legend of this 
figure, '0', '5' and '10' represents the shuffled, the partially shuffled (±5) and the partially shuffled (±10) datasets, respectively. The x-axis represents 
the nucleotide position and they-axis is the P score of (A) QKI and (B) Nova. The black boxes are the RBP-bound sites, and the horizontal dotted 
lines the corrected significance levels of the Bonferroni correction. The vertical dotted lines indicate the ±5 or 1 0 nucleotides of RBP-bound sites. 
RBP, RNA-binding protein. 
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that the positional preferences of RBPs discovered in this 
study also reflect the kinetic aspects of the RBP-RNA 
interactions. For example, Nova had a positional pref- 
erence for upstream of the sequential motif site in the 
unstructured context recognition. In fact, the co-crystal 
structure of human Nova with the target RNA (PDBID: 
1EC6) [37] showed that the area upstream of the sequen- 
tial motif site interacts with the C-terminal amino acids of 
Nova [38] (see Figure 7; note that the CLIP-seq data were 
for a highly similar ortholog, mouse Nova). In addition, 
the deletion of these C-terminal amino acids inhibits the 
RNA binding function of Nova [39] . Therefore, the posi- 
tional preference does likely reflect the kinetic aspects of 
the RNA binding function of Nova. We argue that this 
example demonstrates the potential power of ribonomic 
analysis. 

Three future perspectives are envisioned based on 
the present study. The first perspective is to estimate 
the sequential and structural specificities simultaneously. 
Throughout this study, we focused on the RBPs with 
known and well-defined sequential motifs. Nonetheless, 
for several RBPs, no such sequential motifs have been 
identified (for example, FET binds to a highly flexible 
UAN^Y motif within the hairpin context [16]). To exam- 
ine the binding specificities of these RBPs, CapR needs 
to be extended. The second perspective is prediction of 



RBP-bound sites. Li et al. showed that prediction of RBP- 
bound RNAs in vivo was improved by a motif-finding 
algorithm that considers accessibility [10]. Thus, consid- 
eration of structural profiles may also improve the pre- 
diction of RBP-bound sites in vivo, although we did not 
directly show this in the present study. Further investiga- 
tion is necessary for evaluating whether discrimination of 
RBP-binding sites from a background sequence would be 
improved using the structural specificities of RBP target 
recognition. Other factors or subcellular localizations also 
need to be considered. The third perspective is application 
of CapR to functional RNAs. For example, the kissing hair- 
pin, which is a hairpin-hairpin interaction that stabilizes 
RNA structures [40], may be predicted accurately using 
CapR because CapR enables the calculation of the hairpin 
loop probabilities. Another target would be small nucleo- 
lar RNAs (snoRNAs), where the detection algorithms still 
have room for improvement [41]. Because snoRNAs are 
characterized by specific internal loops, they may also be 
predicted accurately by taking advantage of the accurate 
calculation of internal loop probabilities by CapR. 

Conclusions 

We developed a highly efficient algorithm that calcu- 
lates the probabilities that each RNA base position is 
located within each secondary structural context for tens 




Figure 7 Co-crystal structure of Nova and the target RNA. This figure was generated using Pymol. Tine ten amino acids of tine C-terminal tail are 
shown in red. RNA is represented by green sticks. The positions and the nucleotides are shown in yellow. Position 1 is the start position of the 
sequential motif. 
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of thousands of RNA fragments. The algorithm was 
implemented as software named CapR and was applied 
to the CLIP-seq data of various RBPs. Our algorithm 
demonstrated that several RBPs bind to their target RNA 
molecules under specific structural contexts. For exam- 
ple, FMRl, which is an REP responsible for the fragile 
X syndrome, was found to bind specifically to the inter- 
nal and bulge loops of RNA, Another example is Nova, 
a neuron-specific RBP related to a paraneoplastic neuro- 
logic disorder, which showed positional preference in the 
structural contexts of binding targets. 

Secondary structures are known to be essential for 
the molecular functions of RNA, As large-scale, high- 
throughput approaches are becoming more popular in 
studying RNAs and RBPs, our algorithm will contribute 
to the systematic understanding of RNA functions and 
structure-specific RBP-RNA interactions. 

Materials and methods 

Rfold model 

The state transition rules of the Rfold model are given by 

Outer — > 6 1 Outer • a\Outer • Stem 
Stem — > Z?< • Stem • b> |Z?< • StemEnd • Z?> 
StemEnd — > Sn\Sm • Stem - Sn{m -\- n > 0)|Multi 

Multi — > a ' Multi|MultiBif 
MultiBif — > Multil • Multi2 
Multil — > MultiBif I Multi2 
Multi2 — > Multi2-^|Stem 

where e represents the null terminal symbol, a is an 
unpaired nucleotide character, s/^ is an unpaired base 
string of length k and {b<, b>) is a base pair. There 
are seven non-terminal symbols: Outer, Stem, StemEnd, 
Multi, MultiBif, Multil and Multi2. Outer emits exte- 
rior bases. Stem emits all the base pairs. StemEnd rep- 
resents the end of each stem from which a hairpin 
loop (StemEnd — > Sn)> and internal and bulge loop 
(StemEnd — > Sm • Stem - Snirn -\- n > 0)), or a multi- 
branch loop (StemEnd — > Multi) is emitted. Multi 
represents a complete multibranch loop. Multil, Multi2 
and MultiBif represent parts of a multibranch loop struc- 
ture that contains one or more, exactly one, and two or 
more base pairs in the loop, respectively. Based on this 
grammar, the structural profiles are calculated by using 
a variant of the inside-outside algorithm for SCFG. First, 
we give an illustrative example to show how to calcu- 
late the internal loop probabilities from the inside and 
outside variables (/,;') and psiUj) {i>J = 0, . . .,A/', 5 g 
{Outer,Stem,StemEnd,Multi,MultiBif,Multil,Multi2}), In 
the subsequent section, we completely describe how to 
calculate structural profiles. 



Algorithm for calculating internal loop probabilities 

When a base at position / has an internal loop con- 
text, the base / is caught in two base pairs, (/, k) and 
{p, q) where j < p < q < k (Figure 8), Then, the out- 
side structure of base pair (/, k) and the inside structure 
of base pair {p, q) may take arbitrary structures. The sums 
of Boltzmann weights of all patterns of the outside struc- 
ture of base pair (/, k) and the inside structure of base pair 
{p, q) are represented by outside variable PstemEnd(j> ^< — 1) 
and inside variable astemip — respectively. There- 

fore, Boltzmann weights that the base / is caught in two 
base pairs (/, k) and {p, q) are obtained by the multiplica- 
tion of PstemEnd(j> f< ~ 1)> the score for transition StemEnd 
(/, k — I) ^ Stemip — 1, q), and astem(p — 1, q). Here, we 
sum these Boltzmann weights for all combinations of base 
pairs (/, k) and {p, q). Finally, we obtain p(i,I) by dividing 
the sum by the partition function. 




astem(p5q) 



t(StemEnd 
^Stem) 



PstemEnd(j ^k) 



Figure 8 Schematic illustration of calculation of internal loop 
probability. This figure sliows tine transition patterns tliat emit an 
internal loop. This figure was generated by modifying the output of 
VARNA [42]. 
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The calculation formulas are given by: 



W(ij) = WlnternalLefttt^) + ^InternalRightO'.^) 
l^InternalLeftC^'^^) = 



/ mm(n,j-\-W) min(/+C+l,/c-l) k 

E E E E 

;=max(l,/-W^) k=i-\-l p=i+l ^=max(^+4,/:-C-/7+;-l) 



^InternalRight(^^) — 



i^StemEnd(/> k - I) - astemip - h q) ' ^(StemEnd (Interior) Stem) 

/ min(n,j-\-W) min(/+C+l,/-l) i 

E E E E 

y^maxCl,^-^^) k=i-\-l P=j-^'^ q=mdix(p-{-4,k-C-p-\-J-l) 



PstemEndijy k - 1) • Of stem (p - h q) ■ ^(StemEnd (Interior) Stem) 
p(iyl) = w(iyI)/Z(x) 



where ^(5 s^) is the score for transition 5^5^ and C is the maximal length of the internal and bulge loops. Many 
software programs, including RNAfold [43], adopt this parameter. In this study, following the default setting of RNAfold, 
we set C = 30. 



Algorithms for calculating the structural profile 
The inside algorittim and the outside algorithm 

To calculate the inside and outside variables, we developed a variant of the inside-outside algorithm corresponding to 
the Rfold model. The inside algorithm is described as follows: 



«^Multibif(^>;) = 
^>^Multi2(^>;) = 
«^Multil(^;) = 



Stem) 
StemEnd) 



El o^stem(^ + hj - 1) • ^(Stem 
1 oistemii + hj - 1) • ^(Stem 

El oiM\xltii(i> k) • <^Multi2(^»/) ' ^(MultiBif Multil • Multi2) 
yioT i < k < j 



El (^stemihj) • t(Multi2 Stem) 
I oiMMimihj - 1) • ^(Multi2 Multi2) 

El oiMM\ti2{hj) ' ^(Multil Multi2) 
1 otu^x\tm^{hj) • ^(Multil MultiBif) 



E 

^>^StemEnd(^J*) = ^ 
Of Outer (0 = 



C^Multi(^' + hj) • ^(Multi Multi) 
c^MultiBif(i,;) • ^(Multi MultiBif) 

^(StemEnd (Hairpin)) 
cistern (^^/) • ^(StemEnd (Interior) -> Stem) 
for / < i' < f < 0 < (j - f) + (/^ - i) < C 
oiMuki(i>j) ' ^(StemEnd Multi) 

lif; = 0 

Qfouter(^ — 1) • ^(Outer Outer) 

oiO\xtev(k) ' olstemik, i) • t(Outer Outer • Stem) 

for (i-W) <k< i 
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The outside algorithm is described as follows: 



^Outer(0 = ^ 



^StemEnd(^V) 
^Multi(^;) 

^Multil(^'j*) = 



lif i = N 

Poixtevii + 1) • tiOwtev Outer) 
oistemiUk) ' PouteM ' ^(Outer Outer • Stem) 
ior i < k < W 
Pstem(i — 1,/ + 1) • ^(Stem StemEnd) 

El i^stemEnd(^j') ' ^(StemEnd Multi) 
1 i^Multi(/ - hj) • ^(Multi ^ Multi) 



^MultiBif(^'j') = ^ 



\ i^MuitiBif(i\ k) ' oiMuXmij.k) • ^(MultiBif Multil • Multi2) 

for j <k< (i + W) 



i^Multi2(i\; + 1) • ^(Multi2 Multi2) 
pMnitiiiiJ) • ^(Multil Multi2) 

y^MuitiBif (/<,;) • c^Muitii(^, 0 • ^(MultiBif Multil • Multi2) 

for (j-W) <k < i 

pMix\tii(i>J) ' ^(Multil MultiBif) 
y^Multi(^;) • ^(Multi MultiBif) 

oiOixteriO • Ponterij) ' ^(Outer Outer • Stem) 
y^stemEnd(^^/) ' ^(StemEnd (Interior) Stem) 
for <i<j < /, 0 < (/ - /O + a-f) < C 
y^Multi2(^>;) • ^(Multi2 Stem) 
i^stem(^ - hj + 1) • ^(Stem Stem) 



The original computational complexity of both algorithms is 0{NW^); because we adopted the parameter C, it 
becomes 0{NW^) as described below. 

Calculation of the structural profile 

We calculate the structural profiles from the inside and outside variables computed by the inside-outside algorithm. 
The calculation formula is described as follows: 



Z 

p{UB) 



otoiN) 

/ / min(;^,/+W^) min(/+C+l,A:-l) 

z\ H H 

\/=max(l,!-W') k=i+l p=i+l 

Pse(J, k-l)-as(p-l,k-l)- t(SE -> (Interior) S) 

i min(n,j-\-W) i 

+ E E E 

j=max(l,i-W) k=i+l q=max(j+4,k-C-l) 

PsE(J.k- 1) • as(J,q) ■ f(SE ^ (Interior) ^ S) ) 



p{i,E) = - (aoii - 1) • /So(0 • t{0 ^ O)) 



^ i-l k=m\n(n,i+W) 

p(i,H) = ;^ E E PsE(j,k- 1) ■ ^(SE ^ (Hairpin)) 

7=max(l,;-W) 
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/ / min(«,;+Pr) min(/+C+l,^-l) k 

/^0'^) = | E E E E 

\J=m2ixiU-W) k=i-\-l p=i+l ^=max(/7+4,/:-C-/7+;-l) 

i^SEO; k-l)'as{p-hq)' ^(SE ^ (Interior) ^ S) 

/ min(«,7+W') min(/+C+l,/-l) / 

+ E E E E 

PsE(j>k -l)'as(p-hq)' t(SE (Interior) ^ S) j 
v(i M) = -\ Erif ^""''^ ^m(/ - 1, )^) • aud. k) . ^(M ^ M) 

^ [ Ekmax(0,-Vr) i^M2(/, ^) ' OtMliK / " D • ^(M2 ^ M2) 

' — 1) • t(S ^ SE) 



1) . t{S S) 



Here, O is the outer state, S is the stem state, SE is the 
stem-end state, M is the multi state and M2 is the multi2 
state in the Rfold model. 

Implementation 

We implemented the algorithms in C++ as a program 
named CapR. CapR exhaustively computes the structural 
profile {p{U8)} for a given RNA sequence with 0{NW^) 
time and 0{NW) memory. We used a portion of the 
source code from the Vienna RNA package [43]. We 
include the source code as Additional file 2. Our source 
code is also available from [44]. 

Data preparation and analysis 

To evaluate the accuracy of the structural profiles cal- 
culated by CapR, we used 188 structural RNA families 
in the Rfam 10.0 seed dataset [22]. They are provided 
as 188 structural alignments with experimentally vali- 
dated pseudoknot-free structures. By excluding alignment 
columns with a gap proportion of >0.5, we obtained 8,775 
sequences and 1,039,537 nucleotides. 

In the present study, we focused on RBP target recog- 
nition. In this application, it should be ineffective to 
consider transcribed sequences that are too long because 
regions that are too distant are unlikely to affect the sec- 
ondary structures around the RBP-bound sites, although 
our algorithm itself can be applied to long RNAs. There- 
fore, we investigated how much distance we should take 
into account. We prepared 100 random RNA sequences 
10,100 nucleotides long and truncated them so that the 
lengths of the flanking sequences of the central 100 bases 
became / = 250, 500, . . ., 2,500. Then, we calculated the 
structural profiles of the central 100 bases for each /, and 
calculated the Pearson correlation coefficient between 
the structural profiles of the original sequence and those 
of the truncated sequences. Additional file 1: Figure S8 



shows that the Pearson correlation coefficients were more 
than 0.99 for / > 2, 000. Therefore, we considered 2,000 
nucleotides upstream and downstream of the RBP-bound 
sites in this study. 

To investigate the structural characteristics of RNAs 
around the RBP-binding sites, we downloaded CLIP-seq 
datasets from the doRina database [23] (human [45], 
mouse [46] and nematode [47]). We excluded from the 
analysis CLIP-seq datasets that met one of the following 
three criteria: (1) well-defined sequential motifs not pre- 
sented in the original paper of the dataset, (2) datasets 
for mutant RBPs and (3) the average number of RBP- 
bound sites (that is the sequential motif-matched sites 
within the CLIP-seq peak regions defined in doRina) is 
less than two. The third criterion was adopted because 
many RBP-bound sites include false positives. As a 
result, we selected ten RBPs: GLD-1 (nematode), QKI 
(human), Pum2 (human), SRSFl (human). Nova (mouse), 
Lin28A (mouse), FXRl (human), FXR2 (human), FMR1_7 
(human) and FMR1_1 (human) [7,24-28]. When the peak 
regions spanned just one or two bases, we sought sequen- 
tial motif-matched sites within ±10 nucleotides around 
the peak regions. If no motif-matched sites were found, 
such peak regions were excluded from the analysis. Then, 
we extracted dz2, 000 nucleotide sequences around the 
RBP-bound sites to create the positive datasets. If there 
existed multiple RBP-bound sites in the same peak region, 
we averaged the structural profiles around those sites and 
used them as a single observation. For each gene in RefSeq 
[48], the transcribed sequence was defined by the genomic 
region between the most upstream 5' position and the 
most downstream 3^ position of its mRNA isoforms. To 
generate the shuffled and partially shuffled datasets, we 
used the uShuffle software to preserve the di-nucleotide 
frequencies of the original sequences [49]. The data sizes 
and other basic statistics of the CLIP-seq datasets are 
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summarized in Additional file 1: Tables SI and S2. In the 
present study, because the distributions of the structural 
profiles did not follow a normal distribution, we used the 
non-parametric Wilcoxon-Mann- Whitney test. 

We also examined how the choice of the maximal span 
W influences the results. We compared the highest P 
scores of the exterior and multibranch loops with different 
W because these two loops are sensitive to W, We calcu- 
lated the ratios of the W sensitivity {8) of the highest P 
scores among all positions for each loop 8 calculated at 
W = 400 and 30: 



^vr '4.' '4. / c>\ Highest P score for 8 SitW — 
\^SenSltlVlty(5) = Highest P score for 3 at ^ 



400 
30 



Additional file 1: Figure S9 is a box plot of the W sensi- 
tivity of the exterior loop, multibranch loop and unstruc- 
tured contexts for all the RBP datasets. The highest P 
scores of the exterior and multibranch loops were sensi- 
tive to W, whereas the highest P score of unstructured 
context was insensitive to W. 

Notes added in proof 

After the manuscript was accepted, we were informed that 
the similar algorithm to CapR was internally used in the 
previous researches [50-52]. 

Additional files 



Additional file 1 : Supplementary materials. This file includes additional 
figures and tables not shown in the manuscript. 

Additional file 2: The source code of CapR. This file includes the source 
code of CapR. 
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